Week 10

Review from the last week

From the course website:

Guessing sex based on finish time

Split data by sex, and compute KDE for each group:

Plot density functions:

Predicting sex based on finish time.

Goal: Use KDE to predict sex of a runner based on the finish time.

$$ P(\text{sex} = F | \text{ time} = t_0) = \dfrac{P( \text{ time} = t_0 | \text{ sex} = F)P(\text{ sex} = F)} {P( \text{ time} = t_0)} $$

Also: $$ P( \text{ time} = t_0) = P( \text{ time} = t_0 | \text{ sex} = F)P(\text{ sex} = F) + P( \text{ time} = t_0 | \text{ sex} = M)P(\text{ sex} = M) $$

Exercise

Bayes classifier using KDE (from the course website).

Split marathon data into training and test parts:

Next, we construct a predictor using training data.

Plot finish times vs predicted values:

Add a column with predictions to the test DataFrame:

Check prediction accuracy:

We can predict with high confidence that some runners are men, but such prediction is not possible for women:

As expected, high confidence predictions can be made for very fast runners; 75% of them finished the marathon in less than 3 hours:

Probabilities and predictions with k-NN

Compare with prediction accuracy of k-NN:

The plot below shows classification probabilities using KNN with a given number of neighbors:

Selection of bandwidth and maximum likelihood

Sidenote: 3D plots and contour plots

From the course website:

Sample function:

$$f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)**2 - 150$$

2-dimensional KDE

Toy example: KDE with Gaussian kernel for 2-dimensional data.

Note. scipy.stats.gaussian_kde takes as its argument data organized in 2-dimensional array where each column gives coordinates of one data point. For this reason, when we will use it with data coming from a DataFrame, we will need to transpose the data.

Notice that cross sections of the peaks of the KDE plot below are not circular, since scipy.stats.gaussian_kde uses the covariance matrix of the data to compute bandwidth in each direction.

Back to marathon data

Linear regression and gradient descent

From the course website:

Average tip fraction:

Tip amounts vs error values:

Our goal will be to try to improve the predictions by finding values of parameters $\alpha_1, \alpha_2$ such that the function $$f(\text{total bill}) = \alpha_1\cdot (\text{total bill}) + \alpha_2$$ gives as good as possible predictions of finish times.

We need to specify first what we mean by "as good as possible". Let

We want to find a function $$f(x) = \alpha_1x + \alpha_2$$ such that the value

$$C(\alpha_1, \alpha_2) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2 = \sum_i \left(y^{(i)} - \alpha_1x^{(i)} + \alpha_2\right)^2$$

is as small as possible. Note that $x^{(i)}$ and $y^{(i)}$ are known (they are coming from the data), and the parameters $\alpha_1, \alpha_2$ are unknown. The function $C(\alpha_1, \alpha_2)$ is called the cost function.

Example.

Calculate the cost function for a few different values of $\alpha_0$ and $\alpha_1$:

Gradient descent

Gradient descent is a method of finding a minimum of a function $f$.

Algorithm:

  1. Choose some starting point $\bf x$ and a number $\alpha > 0 $ (the learning rate).
  2. Calculate the gradient $\nabla f(\bf x)$ of $f$ at the point $\bf x$. At the point $\bf x$ the function decreases fastest in the direction the vector $-\nabla f(\bf x)$.
  3. Set the new value of $\bf x$ to be ${\bf x} -\alpha\nabla f(\bf x)$.
  4. Repeat steps 1-3 some number of times.

Example 1. Gradient descent for $p(x, y) = ax^2 + by^2$.

Note. This is a good example to illustrate why before using gradient descent to compute regression of data we need to normalize data features. In the function above if values of the parameters a and b are close to each other, gradient descent converges quickly. Convergence is much slower though if one of these parameters is much larger than the other:

From the course website:

Himmelblau function:

Rosenbrock function:

Back to linear regression and the restaurant tips

Recall:

We want to find a function $$f(x) = \alpha_1x + \alpha_2$$ such that the value

$$C(\alpha_0, \alpha_1) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2 = \sum_i \left(y^{(i)} - \alpha_1x^{(i)} + \alpha_2\right)^2$$

is as small as possible.

Generalization:

Assume that we have a collection of data $\{(x^{(i)}, y^{(i)})\}$ where:

We want to find a function $f(x) = \sum_i\alpha_ix_i$ such that the value

$$C(\alpha_0, {\dots}, \alpha_n) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2$$

is as small as possible.

To use the gradient descent we need to compute $\nabla C(\alpha_0, {\dots}, \alpha_n)$. Check:

$$\nabla C(\alpha_0, {\dots}, \alpha_n) = \sum_i -2\left(y^{(i)} - f(x^{(i)})\right)\cdot \left(x_1^{(i)}, {\dots}, x_n^{(i)}\right) = \sum_i 2 \left(f(x^{(i)}) - y^{(i)}\right)\cdot \left(x_1^{(i)}, \dots, x_n^{(i)}\right)$$

Linear regression with sklearn

Linear regression with total_bill and size

Upshot: Adding size data does not seem to improve meaningfully the fit of the model.

Fit accuracy and $R^2$

Linear regression with total_bill and day:

Create dummy variables for the day column:

Upshot: The day data does not improve the model.